This notebook analyzes the quality of OSM bicycle infrastructure data for a given area. The quality assessment is intrinsic, i.e. based only on the one input data set without makeing use of external information. For an extrinsic quality assessment that compares the OSM data to a user-provided reference data set, see the notebooks 3a and 3b.
The analysis assesses the fitness for purpose (Barron et al., 2014) of OSM data for a given area. Outcomes of the analysis can be relevant for bicycle planning and research - especially for projects that include a network analysis of bicycle infrastructure, in which case the topology of the geometries is of particular importance.
Since the assessment does not make use of an external reference data set as the ground truth, no universal claims of data quality can be made. The idea is rather to enable those working with OSM-based bicycle networks to assess whether the data are good enough for their particular use case. The analysis assists in finding potential data quality issues but leaves the final interpretation of the results to the user.
The notebook makes use of quality metrics from a range of previous projects investigating OSM/VGI data quality, such as Ferster et al. (2020), Hochmair et al. (2015), Barron et al. (2014), and Neis et al. (2012).
For a correct interpretation of some of the metrics for spatial data quality, some familiarity with the area is necessary.
Sections
In this setting, network density refers to the length of edges or number of nodes per km2. This is the usual definition of network density in spatial (road) networks, which is distinct from the structural network density known more generally in network science. Without comparing to a reference data set, network density does not in itself indicate spatial data quality. For anyone familiar with the study area, network density can however indicate whether parts of the area appear to be under- or over-mapped.
Method
The density here is not based on the geometric length of edges, but instead on the computed length of the infrastructure. For example, a 100-meter-long bidirectional path contributes with 200 meters of bicycle infrastructure. This method is used to take into account different ways of mapping bicycle infrastructure, which otherwise can introduce large deviations in network density. With compute_network_density, the number of elements (nodes, dangling nodes, and total infrastructure length) per unit area is calculated. The density is computed twice: first for the study area for both the entire network ('global density'), then for each of the grid cells ('local density'). Both global and local densities are computed for the entire network and for protected and unprotected infrastructure.
Interpretation
Since the analysis conducted here is intrinsic, i.e. it makes no use of external information, it cannot be known whether a low-density value is due to incomplete mapping, or due to actual lack of infrastructure in the area. However, a comparison of the grid cell density values can provide some insights, for example:
For the entire study area, there are: - 289.28 meters of bicycle infrastructure per km2. - 1.16 nodes in the bicycle network per km2. - 0.24 dangling nodes in the bicycle network per km2. - 236.44 meters of protected bicycle infrastructure per km2. - 38.29 meters of unprotected bicycle infrastructure per km2. - 14.55 meters of mixed protection bicycle infrastructure per km2.
In BikeDNA, protected infrastructure refers to all bicycle infrastructure which is either separated from car traffic by for example an elevated curb, bollards, or other physical barriers, or for cycle tracks that are not adjacent to a street.
Unprotected infrastructure are all other types of lanes that are dedicated for bicyclists, but which only are separated by car traffic by e.g., a painted line on the street.
For many practical and research purposes, more information than just the presence/absence of bicycle infrastructure is of interest. Information about e.g. the width of the infrastructure, speed limits, streetlights, etc. can be of high relevance, for example when evaluating the bike friendliness of an area or an individual network segment. The presence of these tags describing attributes of the bicycle infrastructure is however highly unevenly distributed in OSM, which poses a barrier to evaluations of bikeability and traffic stress. Likewise, the lack of restrictions on how OSM features can be tagged sometimes result in conflicting tags which can undermine the evaluation of cycling conditions.
This section includes analyzes of missing tags (edges with tags that lack information), incompatible tags (edges with tags labelled with two or more contradictory tags), and tagging patterns (the spatial variation of which tags are being used to describe bicycle infrastructure).
For the evaluation of tags, the non-simplified edges should be used to avoid issues with tags that have been aggregated in the simplification process.
The information that is required or desirable to obtain from the OSM tags depends on the use case - for example, the tag lit for a project that studies light conditions on cycle paths. The workflow below allows to quickly analyze the percentage of network edges that have a value available for the tag of interest.
Method
We analyze all tags of interest as defined in the existing_tag_analysis section of config.yml. For each of these tags, analyze_existing_tags is used to compute the total number and the percentage of edges that have a corresponding tag value.
Interpretation
On the study area level, a higher percentage of existing tag values indicates in principle a higher quality of the data set. However, this is different from an estimation of whether the existing tag values are truthful. On the grid cell level, lower-than-average percentages for existing tag values can indicate a more poorly mapped area. However, the percentages are less informative for grid cells with a low number of edges: for example, if a cell contains one single edge that has a tag value for lit, the percentage of existing tag values is 100% - but given that there is only 1 data point, this is less informative than, say, a value of 80% for a cell that contains 200 edges.
Analysing tags describing: surface - width - speedlimit - lit - surface: 6644 out of 8479 edges (78.36%) have information. surface: 151 out of 199 km (75.62%) have information. width: 869 out of 8479 edges (10.25%) have information. width: 10 out of 199 km (5.01%) have information. speedlimit: 383 out of 8479 edges (4.52%) have information. speedlimit: 9 out of 199 km (4.40%) have information. lit: 5482 out of 8479 edges (64.65%) have information. lit: 142 out of 199 km (71.33%) have information.
Given that the tags in OSM data lack coherency at times and there are no restrictions in the tagging process (cf. Barron et al., 2014), incompatible tags might be present in the data set. For example, an edge might be tagged with the following two contradicting key-value pairs: bicycle_infrastructure = yes and bicycle = no.
Method
In the config.yml file, a list of incompatible key-value pairs for tags in the incompatible_tags_analysis is defined. Since there is no limitation to which tags a data set could potentially contain, the list is, by definition, non-exhaustive, and can be adjusted by the user. In the section below, check_incompatible_tags is run, which identifies all incompatibility instances for a given area, first on the study area level and then on the grid cell level.
Interpretation
Incompatible tags are an undesired feature of the data set and render the corresponding data points invalid; there is no straightforward way to resolve the arising issues automatically, making it necessary to either correct the tag manually or to exclude the data point from the data set. A higher-than-average number of incompatible tags in a grid cell suggests local mapping issues.
In the entire data set, there are 0 incompatible tag combinations (of those defined in the configuration file).
--------------------------------------------------------------------------- StopIteration Traceback (most recent call last) File ~/opt/anaconda3/envs/bikedna/lib/python3.11/site-packages/folium/utilities.py:99, in validate_locations(locations) 98 try: ---> 99 next(iter(locations)) 100 except StopIteration: StopIteration: During handling of the above exception, another exception occurred: ValueError Traceback (most recent call last) Cell In[14], line 14 9 # iterate through dict of queries, 10 for i, key in enumerate(list(incompatible_tags_edge_ids.keys())): 11 # create one feature group for each query 12 # and append it to list 13 incompatible_tags_fg.append( ---> 14 plot_func.make_edgefeaturegroup( 15 gdf=osm_edges[ 16 osm_edges["edge_id"].isin(incompatible_tags_edge_ids[key]) 17 ], 18 mycolor=pdict["basecols"][i], 19 myweight=pdict["line_emp"], 20 nametag="Incompatible tags: " + key, 21 show_edges=True, 22 ) 23 ) 25 ### Make marker feature group 26 edge_ids = [ 27 item 28 for sublist in list(incompatible_tags_edge_ids.values()) 29 for item in sublist 30 ] # get ids of all edges that have incompatible tags File ~/Library/CloudStorage/OneDrive-ITU/projects/BikeDNA-usecases/src/plotting_functions.py:100, in make_edgefeaturegroup(gdf, myweight, mycolor, nametag, show_edges, myalpha) 97 locs.append(my_locs) # add to list of coordinates for this feature group 99 # make a polyline containing all edges --> 100 my_line = folium.PolyLine( 101 locations=locs, weight=myweight, color=mycolor, opacity=myalpha 102 ) 104 # make a feature group 105 fg_es = folium.FeatureGroup(name=nametag, show=show_edges) File ~/opt/anaconda3/envs/bikedna/lib/python3.11/site-packages/folium/vector_layers.py:169, in PolyLine.__init__(self, locations, popup, tooltip, **kwargs) 168 def __init__(self, locations, popup=None, tooltip=None, **kwargs): --> 169 super().__init__(locations, popup=popup, tooltip=tooltip) 170 self._name = "PolyLine" 171 self.options = path_options(line=True, **kwargs) File ~/opt/anaconda3/envs/bikedna/lib/python3.11/site-packages/folium/vector_layers.py:119, in BaseMultiLocation.__init__(self, locations, popup, tooltip) 117 def __init__(self, locations, popup=None, tooltip=None): 118 super().__init__() --> 119 self.locations = validate_locations(locations) 120 if popup is not None: 121 self.add_child(popup if isinstance(popup, Popup) else Popup(str(popup))) File ~/opt/anaconda3/envs/bikedna/lib/python3.11/site-packages/folium/utilities.py:101, in validate_locations(locations) 99 next(iter(locations)) 100 except StopIteration: --> 101 raise ValueError("Locations is empty.") 102 try: 103 float(next(iter(next(iter(next(iter(locations))))))) ValueError: Locations is empty.
Interactive map saved at results/OSM/Belgrade/maps_interactive/tagsincompatible_osm.html
Identifying bicycle infrastructure in OSM can be tricky due to the many different ways in which the presence of bicycle infrastructure can be indicated. The OSM Wiki is a great resource for recommendations for how OSM features should be tagged, but some inconsistencies and local variations can remain. The analysis of tagging patterns allows to visually explore some of the potential inconsistencies.
Regardless of how the bicycle infrastructure is defined, examining which tags contribute to which parts of the bicycle network allows to visually examine patterns in tagging methods. It also allows to estimate whether some elements of the query will lead to the inclusion of too many or too few features.
Likewise, 'double tagging' where several different tags have been used to indicate bicycle infrastructure can lead to misclassifications of the data. For this reason, identifying features that are included in more than one of the queries defining bicycle infrastructure can indicate issues with the tagging quality.
Method
We first plot individual subsets of the OSM data set for each of the queries listed in bicycle_infrastructure_queries, as defined in the config.yml file. The subset defined by a query is the set of edges for which this query is True. Since several queries can be True for the same edge, the subsets can overlap. In the second step below, all overlaps between 2 or more queries are plotted, i.e. all edges that have been assigned several, potentially competing, tags.
Interpretation
The plots for each tagging type allow for a quick visual overview of different tagging patterns present in the area. Based on local knowledge, the user may estimate whether the differences in tagging types are due to actual physical differences in the infrastructure or rather an artefact of the OSM data. Next, the user can access overlaps between different tags; depending on the specific tags, this may or may not be a data quality issue. For example, in case of 'cycleway:right' and 'cycleway:left', having data for both tags is valid, but other combinations such as 'cycleway'='track' and 'cycleway:left=lane' gives an ambiguouos picture of what type of bicycle infrastructure is present.
Tagging type A: highway == 'cycleway' Tagging type B: cycleway in ['lane','track','opposite_lane','opposite_track','designated','crossing'] Tagging type C: cycleway_left in ['lane','track','opposite_lane','opposite_track','designated','crossing'] Tagging type D: cycleway_right in ['lane','track','opposite_lane','opposite_track','designated','crossing'] Tagging type E: cycleway_both in ['lane','track','opposite_lane','opposite_track','designated','crossing']
Interactive map saved at results/OSM/Belgrade/maps_interactive/taggingtypes_osm.html
Interactive map saved at results/OSM/Belgrade/maps_interactive/taggingcombinations_osm.html
This section explores the geometric and topological features of the data. These are, for example, network density, disconnected components, and dangling (degree one) nodes. It also includes exploring whether there are nodes that are very close to each other but do not share an edge - a potential sign of edge undershoots - or if there are intersecting edges without a node at the intersection, which might indicate a digitizing error that will distort routing on the network.
Due to the fragmented nature of most bicycle networks, many metrics, such as missing links or network gaps, can simply reflect the true extent of the infrastructure (Natera Orozco et al., 2020). This is different for road networks, where e.g., disconnected components could more readily be interpreted as a data quality issue. Therefore, the analysis only takes very small network gaps into account as potential data quality issues.
To compare the structure and true ratio between nodes and edges in the network, a simplified network representation which only includes nodes at endpoints and intersections was created in notebook 1a by removing all interstitial nodes.
Comparing the degree distribution for the networks before and after simplification is a quick sanity check for the simplification routine. Typically, the vast majority of nodes in the non-simplified network will be of degree two; in the simplified network, however, most nodes will have degrees other than two. Degree two nodes are retained in only two cases: if they represent a connection point between two different types of infrastructure; or if they are needed in order to avoid self-loops (edges whose start and end points are identical) or multiple edges between the same pair of nodes.
Non-simplified network (left) and simplified network (right).
Method
The degree distributions before and after simplification are plotted below.
Interpretation
Typically, the degree distribution will go from high (before simplification) to low (after simplification) counts of degree two nodes, while it will not change for all other degrees (1, or 3 and higher). Further, the total number of nodes will see a strong decline. If the simplified graph still maintains a relatively high number of degree two nodes, or if the number of nodes with other degrees changes after the simplification, this might point to issues either with the graph conversion or with the simplification process.
Simplifying the network decreased the number of edges by 94.1% and the number of nodes by 90.2%.
Dangling nodes are nodes of degree one, i.e. they have only one single edge attached to them. Most networks will naturally contain a number of dangling nodes. Dangling nodes can occur at actual dead-ends (representing a cul-de-sac) or at the endpoints of certain features, e.g. when a bicycle path ends in the middle of a street. However, dangling nodes can also occur as a data quality issue in case of over/undershoots (see next section). The number of dangling nodes in a network does to some extent also depend on the digitization method, as shown in the illustration below.
Therefore, the presence of dangling nodes is in itself not a sign of low data quality. However, a high number of dangling nodes in an area that is not known for containing many dead-ends can indicate digitization errors and problems with edge over/undershoots.
Left: Dangling nodes occur where road features end. Right: However, when separate features are joined at the end, there will be no dangling nodes.
Method
Below, a list of all dangling nodes is obtained with the help of get_dangling_nodes. Then, the network with all its nodes is plotted. The dangling nodes are shown in color, all other nodes are shown in black.
Interpretation
We recommend a visual analysis in order to interpret the spatial distribution of dangling nodes, with particular attention to areas of high dangling node density. It is important to understand where dangling nodes come from: are they actual dead-ends or digitization errors (e.g., over/undershoots)? A higher number of digitization errors points to lower data quality.
Interactive map saved at results/OSM/Belgrade/maps_interactive/danglingmap_osm.html
When two nodes in a simplified network are placed within a distance of a few meters, but do not share a common edge, it is often due to an edge over/undershoot or another digitizing error. An undershoot occurs when two features are supposed to meet, but instead are just in close proximity to each other. An overshoot occurs when two features meet and one of them extends beyond the other. See the image below for an illustration. For a more detailed explanation of over/undershoots, see the GIS Lounge website.
Left: Undershoots happen when two line features are not properly joined, for example at an intersection. Right: Overshoots refer to situations where a line feature extends too far beyond at intersecting line, rather than ending at the intersection.
Method
Undershoots: First, the length_tolerance (in meters) is defined in the cell below. Then, with find_undershoots, all pairs of dangling nodes that have a maximum of length_tolerance distance between them, are identified as undershoots, and the results are plotted.
Overshoots: First, the length_tolerance (in meters) is defined in the cell below. Then, with find_overshoots, all network edges that have a dangling node attached to them and that have a maximum length of length_tolerance are identifed as overshoots, and the results are plotted.
The method for over/undershoot detection is inspired by Neis et al. (2012).
Interpretation
Under/overshoots are not necessarily always a data quality issue - they might be instead an accurate representation of the network conditions or of the digitization strategy. For example, a cycle path might end abruptly soon after a turn, which results in an overshoot. Protected cycle paths are sometimes digitized in OSM as interrupted at intersections which results in intersection undershoots.
The interpretation of the impact of over/undershoots on data quality is context dependent. For certain applications, such as routing, overshoots do not present a particular challenge; they can, however, pose an issue for other applications such as network analysis, given that they skew the network structure. Undershoots, on the contrary, are a serious problem for routing applications, especially if only bicycle infrastructure is considered. They also pose a problem for network analysis, for example for any path-based metric, such as most centrality measures like betweenness centrality.
2 potential overshoots were identified using a length tolerance of 3 m. 2 potential undershoots were identified using a length tolerance of 3 m.
Interactive map saved at results/OSM/Belgrade/maps_interactive/underovershoots_3_3_osm.html
When two edges intersect without having a node at the intersection - and if neither edges are tagged as a bridge or a tunnel - there is a clear indication of a topology error.
Method
First, with the help of check_intersection, each edge which is not tagged as either tunnel or bridge is checked for any crossing with another edge of the network. If this is the case, the edge is marked as having an intersection issue. The number of intersection issues found is printed and the results are plotted for visual analysis. The method is inspired by Neis et al. (2012).
Interpretation
A higher number of intersection issues points to a lower data quality. However, it is recommended with a manual visual check of all intersection issues with a certain knowledge of the area, in order to determine the origin of intersection issues and confirm/correct/reject them.
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[33], line 1 ----> 1 missing_nodes_edge_ids, edges_with_missing_nodes = eval_func.find_missing_intersections( 2 osm_edges, "edge_id" 3 ) 5 count_intersection_issues = ( 6 len(missing_nodes_edge_ids) / 2 7 ) # The number of issues is counted twice since both intersecting osm_edges are returned 9 print( 10 f"{count_intersection_issues:.0f} place(s) appear to be missing an intersection node or a bridge/tunnel tag." 11 ) File ~/Library/CloudStorage/OneDrive-ITU/projects/BikeDNA-usecases/src/evaluation_functions.py:470, in find_missing_intersections(edges, edge_id_col, return_edges) 461 """ 462 Detects topological errors in gdf with edges from OSM data. 463 If two edges are intersecting (i.e. no node at intersection) and neither is tagged as a bridge or a tunnel, 464 it is considered an error in the data. 465 """ 467 # Don't include tunnels or bridges 468 edges_subset = edges.loc[ 469 ~( --> 470 edges.tunnel.isin( 471 ["yes", "Yes", True, "passage", "building_passage", "movable"] 472 ) 473 | edges.bridge.isin( 474 ["yes", "Yes", True, "passage", "building_passage", "movable"] 475 ) 476 ) 477 ].copy() 479 edges_subset["intersection_issues"] = edges_subset.apply( 480 lambda x: check_crossing(row=x, gdf=edges_subset), axis=1 481 ) 483 missing_nodes = list( 484 edges_subset.loc[ 485 (edges_subset.intersection_issues.notna()) (...) 488 ][edge_id_col].values 489 ) File ~/opt/anaconda3/envs/bikedna/lib/python3.11/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name) 5895 if ( 5896 name not in self._internal_names_set 5897 and name not in self._metadata 5898 and name not in self._accessors 5899 and self._info_axis._can_hold_identifiers_and_holds_name(name) 5900 ): 5901 return self[name] -> 5902 return object.__getattribute__(self, name) AttributeError: 'GeoDataFrame' object has no attribute 'tunnel'
Disconnected components do not share any elements (nodes/edges). In other words, there is no network path that could lead from one disconnected component to the other. As mentioned above, most real-world networks of bicycle infrastructure do consist of many disconnected components (Natera Orozco et al., 2020). However, when two disconnected components are very close to each other, it might be a sign of a missing edge or another digitizing error.
Method
First, with the help of return_components, a list of all (disconnected) components of the network is obtained. The total number of components is printed and all components are plotted in different colors for visual analysis. Next, the component size distribution (with components ordered by the network length they contain) is plotted, followed by a plot of the largest connected component.
Interpretation
As with many of the previous analysis steps, knowledge of the area is crucial for a correct interpretation of component analysis. Given that the data represents the actual infrastructure accurately, bigger components indicate coherent network parts, while smaller components indicate scattered infrastructure (e.g., one single bicycle path along a street that does not connect to any other bicycle infrastructure). A high number of disconnected components in near vicinity of each other indicates digitization errors or missing data.
The network in the study area has 24 disconnected components.
The distribution of all network component lengths can be visualized in a so-called Zipf plot, which orders the lengths of each component by rank, showing the largest component's length on the left, then the second largest component's length, etc., until the smallest component's length on the right. When a Zipf plot follows a straight line in log-log scale, it means that there is a much higher chance to find small disconnected components than expected from traditional distributions (Clauset et al., 2009). This can mean that there has been no consolidation of the network, only piece-wise or random additions (Szell et al., 2022), or that the data itself suffers from many gaps and topology errors resulting in small disconnected components.
However, it can also happen that the largest connected component (the leftmost marker in the plot at rank $10^0$) is a clear outlier, while the rest of the plot follows a different shape. This can mean that at the infrastructure level, most of the infrastructure has been connected to one large component, and that the data reflects this - i.e. the data is not suffering from gaps and missing links to a large extent.
Bicycle networks might also be somewhere inbetween, with several large components as outliers.
The largest connected component contains 77.40% of the network length.
In the plot of potential missing links between components, all edges that are within the specified distance of an edge on another component are plotted. The gaps between disconnected edges are highlighted with a marker. The map thus highlights edges which, despite being in close proximity of each other, are disconnected and where it thus would not be possible to bike on cycling infrastructure between the edges.
Running analysis with component distance threshold of 10 meters.
Interactive map saved at results/OSM/Belgrade/maps_interactive/component_gaps_10_osm.html
Here we visualize differences between how many cells can be reached from each cell. This is a crude measure for network connectivity but has the benefit of being computationally cheap and thus able to quickly highlight stark differences in network connectivity.
| Total infrastructure length (km) | 112 |
|---|---|
| Protected bicycle infrastructure density (m/km2) | 236 |
| Unprotected bicycle infrastructure density (m/km2) | 38 |
| Mixed protection bicycle infrastructure density (m/km2) | 15 |
| Bicycle infrastructure density (m/km2) | 289 |
| Nodes | 450 |
| Dangling nodes | 95 |
| Nodes per km2 | 1 |
| Dangling nodes per km2 | 0 |
| Incompatible tag combinations | 0 |
| Overshoots | 2 |
| Undershoots | 2 |
| Components | 24 |
| Length of largest component (km) | 85 |
| Largest component's share of network length | 77% |
| Component gaps | 7 |